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Abstract. We study the problem of constructing phylogenetic trees for 
a given set of species. The problem is formulated as that of finding a 
minimum Steiner tree on n points over the Boolean hypercube of dimen- 
sion d. It is known that an optimal tree can be found in linear time Q] 
if the given dataset has a perfect phylogeny, i.e. cost of the optimal phy- 
logeny is exactly d. Moreover, if the data has a near-perfect phylogeny, 
i.e. the cost of the optimal Steiner tree is d + q, it is known [2] that an 
exact solution can be found in running time which is polynomial in the 
number of species and d, yet exponential in q. In this work, we give a 
polynomial-time algorithm (in both d and q) that finds a phylogenetic 
tree of cost d+0(q 2 ). This provides the best guarantees known — namely, 
a (1 + o(l))-approximation — for the case log(d) < q < broadening 
the range of settings for which near-optimal solutions can be efficiently 
found. We also discuss the motivation and reasoning for studying such 
additive approximations. 



1 Introduction 

Phylogenetics, a subficld of computational biology, aims to construct simple 
and accurate descriptions of evolutionary history. These descriptions are rep- 
resented as evolutionary trees for a given set of species, each of which is rep- 
resented by some set of features ( |3I4| ). A typical choice for these features are 
single nucleotide polymorphisms (SNPs), binary indicator variables for common 
mutations found in DNA [5Itj] ; see, for example, |2lll7l8l9ll0j . This challenging 
problem has attracted much attention in recent years, with progress in studying 
various computational formulations of this problem ( |3llll2llll2l7j ). The prob- 
lem is often posed as that of constructing the most parsimonious tree induced 
by the set of species. 

Formally, a phylogeny or a phylogenetic tree for a set C of n species, each 
represented by a string (called taxa) of length d over a finite alphabet S, is an 
unrooted tree T = (V, E) such that C C V C S d . Given a distance metric /i over 
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£ , we define the cost of T as v)£E m( u j v )- The tree of maximum parsimony 
for a dataset is the tree which minimizes this cost with respect to the Hamming 
metric; i.e., it is the optimum Steiner tree for the set C under this metric. 

The Steiner tree problem is known to be NP-hard in general [13) . and re- 
mains NP-hard even in the case of a binary alphabet with the metric induced 
by the Hamming distance |14j . Extensive recent work, both experimental and 
theoretical, has focused on the binary character set with the Hamming metric 
( |3I2I1I12I7I4I15I1"B] ). This version of the phylogeny problem will also be the focus 
of this paper. 

A phylogeny is called perfect if each coordinate i € [d] flips exactly once in the 
tree (representing a single mutation of i amongst the set of species If a dataset 
admits a perfect phylogeny, an optimal tree can be constructed in polynomial 
time |17| (even linear time, in the case where the alphabet is binary [3]). In 
this work, we investigate near perfect phylogenies - instances whose optimal 
phylogenetic tree has cost d + q, where q <C d. Near perfect phylogenies have 
been studied in theoretical f |11I2I12I16| ) and experimental settings ([H])- The 
work of |lll2U2lirj] has given a series of randomized algorithms which find the 
optimal phylogeny in running time polynomial in n and d but exponential in q. 
Clearly, when q = w(logd), these algorithms are not tractable. 

An alternative approach for finding a phylogenetic tree of low cost is to use a 
generic Steiner tree approximation algorithm. The best current such algorithm 
yields a tree of cost at most 1.39(d+q) [TB] (we comment that the exponential size 
of the explicit hypercube with respect to its small representation size requires 
one implement such an algorithm using techniques devised especially for the 
hypercube, e.g. Alon et al. [7].) However, notice that for moderate q (e.g., for 
q = polylog(d)), the excess of this tree — meaning the difference between its cost 
and d — may be extremely large compared to the excess q of the optimal tree. In 
such cases, one would much prefer an algorithm whose excess could be written 
as a function of q only. 

In this work, we present a randomized poly(n, d, g)-time algorithm that finds 
a phylogenetic tree of cost d + 0(q 2 ). 

Theorem 1. Given a set C C {0, l} d of n terminals, such that the optimal phy- 
logeny of C has cost d + q, there exists a randomized poly(n, d, q)-time algorithm 
that finds a phylogenetic tree of cost d+ 0(q 2 ) w.p. > 1/2. 

Note that Theorem [T] provides a substantial improvement over prior work 
for the case that logd -C q <C vd. In this range, the exact algorithms are no 
longer tractable, and the multiplicative approximations yield significantly worse 
bounds. Alternatively, viewed as a multiplicative guarantee, in this range our tree 
is within a 1 + o(l) factor of optimal. To the best of our knowledge, this is the 
first work to give an additive poly-time approximation to either the phylogeny 
problem or any (non-trivial) setting the Steiner tree problem. One immediate 
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question, which remains open, is whether our results can be improved to d+o(q 2 ) 
or perhaps even to d + 0(q). 

The rest of the paper is organized as follows. After surveying related work in 
Section [1 .11 we detail notation and preliminaries in Section [5J The presentation 
of our algorithm is partitioned into two parts. In Section [3l wc present the 
algorithm for the case where no pair of coordinates is identical over all terminals 
(formal definition there). In SectionUJ we alter the algorithm for the simple case, 
in a nontrivial way, so that the modified algorithm finds a low-cost phylogeny 
for any dataset. We conclude in Section [5] with a discussion, motivating the 
problem of near-perfect phylogeny tree from a different perspective, and present 
open problems for future research. 

1.1 Related Work 

As mentioned in the introduction, the problem of constructing an optimal phy- 
logeny is NP-complete even when restricted to binary alphabets [T3] ■ Schwartz 
et al. [TT] give an algorithm based on an Integer Linear Programming (ILP) 
formulation to solve the multi-state problem optimally, and show experimen- 
tally the algorithm is efficient on small instances. Perfect phylogenics (datasets 
which admit a tree in which any coordinate changes exactly once) have optimal 
parsimony trees which can be constructed in linear time in the binary case [T] 
and in polynomial time for a fixed alphabet |12| . Unfortunately, finding the per- 
fect phylogeny for arbitrary alphabets is NP-hard [19] . Recent work [2] gives an 
algorithm to construct optimal phylogenetic trees for binary, near-perfect phy- 
logenics (where only a small number of coordinates mutate more than once in 
the optimal tree). However, the running time of the algorithm presented in their 
work [5] is exponential in the number of additional mutations. 

There has also been a lot of work on computing multiplicative approximations 
to the Stciner tree problem. A Minimum Spanning Tree (MST) over the set of 
terminals achieves an approximation ratio of 2 and a long line of work has led to 
the current best bound of 1.39 |20|21|22|23|24|8|25|9|10|18| . The more recent of 
these papers use a result due to Borchers and Du [55] showing that an optimal 
Steiner tree can be approximated to arbitrary precision using fc-restricted Steiner 
trees. 

Some of these approximations to the Steiner tree problem are not imme- 
diately extendable to the problem of constructing phylogenetic trees. This is 
because the size of the vertex set for the phylogeny problem is exponential in 
d (there are 2 d vertices in the hypercube). If an algorithm works on an explicit 
representation of the graph G defined by the hypercube, then it does not solve 
the phylogeny problem in polynomial time. However, the line of work started by 
Robins and Zelikovsky |9|10j used the notion of fc-restricted Steiner trees, which 
can be efficiently implemented on the hypercube. In particular, Alon et al. [7] 
showed that in finding the optimal fc-restricted component for a given set of fc 
terminals, it is sufficient to only consider topologies with the given fc terminals 
at the leaves. Using this, they were able to extend that work to achieve a 1.55 



approximation ratio for the maximum parsimony problem, and a 16/9 approxi- 
mation for maximum likelihood. Byrka et al. [TS] considered a new LP relaxation 
to the fc-restricted Steiner tree problem and achieved an approximation ratio of 
1.39, which can be combined with the topological argument from Alon et al. [7] 
to achieve the same ratio for phylogenies. 

2 Notation and Preliminaries 

Our dataset C C {0, l} d consists of n terminals over d binary coordinates. A 
Steiner tree (or phylogeny) over C consists of a tree T on the hypercube that 
spans C (plus possibly additional Steiner nodes), where we label each edge e in 
T with the index i e {1, . . . , d} of the coordinate flipped on edge e. The cost 
of such a Steiner tree is the number of edges in the tree. Given a collection of 
datascts V = {Pi, P 2 , . . . , Pt} Q C we define the Steiner forest problem as the 
problem of finding a minimal Steiner tree on every FgP separately. We refer to 
such collection as a partition from now on, even though it may contain a subset 
of the original terminal set C. 

In this work, we consider instances C whose minimum Steiner tree has cost 
d+q, and think of q = o(Vd) (otherwise, any off-the-shelf constant approximation 
algorithm for the Steiner problem gives a solution of cost < d+0(q 2 )). We fix T 
to be some optimal Steiner tree. By optimality, all leaves in T must be terminals, 
whereas the internal nodes of T may be either terminals or non-terminals (non- 
terminals are called Steiner nodes). We define a coordinate i to be good if exactly 
one edge in T is labeled i, and bad if two or more edges in T are labeled with i. 
We may assume all d coordinates appear in the tree, otherwise, some coordinates 
in C are fixed and so the dimensionality of the problem is less than d. Therefore, 
at most q coordinates are bad (each bad coordinate flips at least twice and thus 
adds a cost of at least 2 to the tree). 

Given a coordinate i of a set of terminals P, we define an i-cut as the partition 
Pq = {x € P : Xi = 0} and Pi = {x G P : x,i = 1}. We call two coordinates 
i 7^ j interchangeable if they define the same cut. We now present the following 
basic facts which are easy to verify (see [2] for proofs). 

Fact 1 

1. Let S be a set of interchangeable coordinates. Then all coordinates in S 
appear together in the optimal tree T , adjacent to one another. That is, in T 
there are paths s.t. for each path: all of its edges are labeled by some i £ S, 
all coordinates in S have an edge on the path, and all internal nodes on the 
paths aren't terminals and have degree 2. On these paths, any reordering the 
S-labeled edges yields an equivalent optimal tree. 

2. For any two good coordinates, i ^ j, one side of the i-cut is contained within 
one side of the j-cut. Equivalently, there exist values bj such that all termi- 
nals on one side of the i-cut have their jth coordinate set to bj. 

3. Fix any good coordinate i and let j be a good coordinate such that all ter- 
minals on one side of the i-cut have their j coordinate set to bj. Then both 
endpoints of the edge labeled i have their jth coordinate set to bj. 



4- A good coordinate i and a bad coordinate i' cannot define the same cut. 

It immediately follows from Fact [1] that for a given good coordinate i one can 
efficiently reconstruct the endpoints of the edge on which i mutates, except for at 
most q coordinates. This leads us to the following definition. Given i, we denote 
D 1 as the set of all coordinates that are fixed to a constant value Vi on at least one 
side of the i-cut (different coordinates may be fixed on different sides), and we 
denote b 1 as the vector of the corresponding values, i.e. Vi's, of the coordinates in 
D l . The pair (D l ,h z ) is called the pattern of coordinate i. That set of terminals 
that match the pattern of i is the set P b i = {x £ P : Vj £ D 1 , Xj = 6* }. 

3 A Simple Case: Each Coordinate Determines a Distinct 
Cut 

To show the main ideas behind our algorithm, we first discuss a special case 
in which no two coordinates i and j define the same cut on the terminal set 
C. Algorithms for constructing phylogenetic trees often make this assumption 
as they preprocess C by contracting any pair of interchangeable coordinates. 
However, in our case such contractions are problematic, as we discuss in the 
next section. So in Section 2J when we deal with the general case, we deal with 
interchangeable coordinates in a non-trivial fashion. 

3.1 Basic Building Blocks 

We now turn to the description of our algorithm. On a high level it is motivated 
by the notion of maintaining a proper partition of the terminals. 

Definition 1. Call a partition V proper if the forest produced by restricting the 
optimal tree T to the components P £ "P is composed of edge disjoint trees. 

Equivalently, the path in T between two nodes x and y in the same component 
P of V does not pass through any node x' in any different component P' of V. 
Clearly, our initial partition, V = {C}, is proper. Our goal is to maintain a 
proper partition of the current terminals while decreasing the dimensionality of 
the problem in each step. This is implemented by the two subroutines we now 
detail. 

Pluck a Leaf and Paste a Leaf. The first subroutine works by building the 
optimal phylogeny bottom-up, finding a good coordinate i adjacent to a leaf 
terminal t in the tree, and replacing t with its parent (t with i flipped) in the set 
of terminals. Observe that if i is a good coordinate, then this removes the only 
occurrence of i, leaving all terminals in our new dataset with a fixed i coordinate, 
thus reducing the dimensionality of the problem by 1. 

The matching subroutine to Pluck-a-leaf is Paste-a-leaf : if Pluck-a-leaf 
succeeds and returns some (x, V'), and we have found a Steiner forest for the ter- 
minals in V' . Then Paste-a-leaf merely connects x with x l by an edge labeled 
i, then returns the resulting forest. (We omit formal description.) 



Pluck-a-leaf 

input: A partition V of current terminals. 

if there exists P £P and x £ P s.t. some coordinate i is non-constant on P, but only 
the terminal x has Xi = (or Xi = 1), then: 

— Set P' = P \ {x} U {a; 1 }, where s l is identical to x except for flipping i. 

- Return x and 7" = V \ {P} U {P'}. 
else fail. 



Lemma 1. IfP is a proper partition and Pluck-a-leaf succeeds, thenV' is a 
proper partition. 

Proof (Sketch). Let T[P] be the subtree in which x resides. We claim that x is 
a leaf in T[P], attached by an edge labeled i to the rest of the terminals. If this 
indeed is the case, then removing i means removing a leaf-adjacent edge from 
T[P] which clearly leaves all components in the forest edge-disjoint. 

Wlog x lies on the i = side of the cut. If x isn't a leaf, then at least two 
disjoint paths connect x to two other terminals. Since V is proper, both these 
terminals are in P. This means T[P] crosses the i-cut twice, but then we can 
replace T[P] with an even less costly tree in which i is flipped once, by projecting 
the path between the two occurrences of i onto the i = 1 side. 

□ 

Observe that lemma [T] holds only when the underlying alphabet of the prob- 
lem is binary. In particular, for a non-binary alphabet, such x can be a non-leaf. 

Split and Merge. When Pluck-a-leaf can no longer find leaves to pluck, 
we switch to the second subroutine, one that works by splitting the set of termi- 
nals into two disjoint sets, based on the value of the i-th coordinate. We would 
like to split our set of terminals according to the i-cut, and recurse on each side 
separately. But, in order to properly reconnect the two subproblems, we need 
to introduce the two endpoints of the i-labclcd edge to their respective sides of 
the i-cut. Our Split subroutine deals with one particular case in which these 
endpoints are easily identified. 



Split (i) 

input: A partition V of current terminals, a coordinate i that is not constant on every 
component of V. 

— Find a component P on which i isn't constant. Denote the i-cut of P as (Po, Pi). 

— Find P b i , the set of terminals that match the pattern of i. 

— if exists some x which is the unique terminal that matches the pattern of i in one 
side of the cut (that is, if for some x we have P b i n Po = {x} or P b ; n Pi = {x}) 

• Flip the i-th coordinate of x, and let x 1 be the resulting node. 

• Add x to its side of the i-cut, add x 1 to the other side of the cut. 

• Return x, x* and V = V \ {P} U {Po, Pi}, 
else fail. 



The matching subroutine to Split is Merge: Assume Split succeeds and 
returns some (x, x 1 , V), and assume we have found a Steiner forest for the ter- 
minals in V' . Then Merge merely connects x with x l by an edge labeled i, then 
returns the resulting forest. (Again, formal description is omitted.) 

Lemma 2. Assume V is a proper partition. Assume Split is called on a good 
coordinate i s.t. the edge labeled i in T has at least one endpoint which is a 
terminal. Then the returned partition V' is proper. 

Proof (Sketch). Since V is proper, then the induced tree T[P] is the only tree in 
the forest that contains the i-labeled edge. The lemma then follows from showing 
that x and x 1 are the two endpoints of i-labcled edge in T[P]. This follows from 
the observation that the endpoints of the i-labeled edge must both match the 
pattern of i. Let u be an endpoint and wlog u belongs to the (i = 0)-side of the 
cut. On all coordinates that are fixed on the (i = 0)-sidc, u obviously has the 
right values. All coordinates that are fixed on the (i = l)-side can only flip on 
the (i = 0)-side, but only after traversing u, so u has them set to the value fixed 
on the (i = l)-side. 

□ 

3.2 The Algorithm 

We can now introduce our algorithm, 
input: A partition V of current terminals. Initially, V is the singleton set V — {C}. 

1. if Pluck-a-leaf succeeds and returns (x,V') 

— recurse on V\ then Paste-a-leaf x back and return the resulting forest. 

2. else-if the number of non-constant coordinates on V is at least 40g~ 

— Pick a non-constant coordinate i u.a.r and invoke Split (i) . 

— if Split succeeds: recurse on V' , then Merge x and x 1 , and return the result- 
ing forest; otherwise fail. 

3. else 

— For every P <E V find its MST, T(P), and return the forest {T(P)}. 



Fig. 1. Algorithm for the simple case 

Theorem 2. With probability > 1/2, the algorithm in Figure [7] returns a tree 
whose cost is at most d + 0(q 2 ). 

In order to prove Theorem [5J fix an optimal phylogeny T over our initial set 
of terminals, and for any partition V our algorithm creates, denote T[P] as the 
forest induced by T on this partition. The proof of the theorem relies on the 
following lemma. 

Lemma 3. IfV is a proper partition, then with probability > 1 — (Sq)^ 1 , Split 
is called on a good coordinate and succeeds. Furthermore, Split is executed at 
most 4q times. 



Proof (of Theorem^. The proof follows from lemmas [T] and [3J Since we start 
with a proper partition, then with probability at least 1 — (4q)(8q)~ 1 > 1/2 we 
keep recursing on proper partitions, until reaching the base of the recursion. By 
the time the algorithm reaches the base of the recursion, the dimensionality of 
the problem was reduced to d! < 40g 2 , so the cost of the optimal Steiner forest 
is at most d! + q. As MSTs give a 2-approximation to the optimal Steiner tree 
problem, our forest is of cost < 2(d' + q). Then, the algorithm reconnects the 
forest, adding the coordinates (edges) the algorithm as removed in the first two 
steps of the algorithm. Since the algorithm removed at most d ~ d' edges, the 
tree it outputs is of overall cost at most d— d! + 2(d' + q) = d + 40g 2 + 2q. 

a 

Proof (of Lemma\Bj) . Let V be the partition in the first iteration of the algorithm 
for which Split was invoked, and assume V is proper. Thus, the forest T[P] 
contains disjoint components. We call any vertex in this forest of degree > 3 
an internal split. Suppose we replace each internal split v with deg{v) many 
new vertices, each adjacent to one edge. This breaks the forest into a collections 
of paths we call the path decomposition of the tree. In addition, remove from 
this path decomposition all edges that are labeled with a bad coordinate to 
obtain the good path decomposition. Denote the number of paths in the good 
path decomposition as t. 

First, we claim that any call to Split (on V or any partition succeeding 
V), on a coordinate i which lies on a path of length > 2 in the abovementioned 
decomposition, does not fail. 

Assume Split was called on i and denote its adjacent coordinate on the 
path as j (choose one arbitrarily if i has two adjacent coordinates on its path), 
and both are non-constant on P € V ■ Observe that our decomposition leaves 
only good coordinates, so both i and j are good. Therefore, j is fixed on one 
side of the z-cut and i is fixed on one side of the j-cut. It follows that there exist 
binary values bi,bj s.t. for every x <G P, if xi = bi then Xj = bj; and if x 3 ■ = 1 — bj 
then Xi = 1 — bi. In fact, the only node on the entire tree for which Xi = 1 — &, 
and Xj = bj is the node connecting the i-edge and the j-edge. Recall that we 
assume for the special case i and j do not define the same cut. It follows that 
the node between i and j has to be a terminal, so now we can use Lemma [2] and 
deduce Split succeeds. 

So, Split can either fail or return a non-proper partition only if it was 
invoked either on a bad coordinate or on a good coordinate that lies on a path 
of length 1 in our path decomposition. There arc at most q bad coordinates and 
at most t paths of length 1, so each call to Split fails w.p. < J^-r- Furthermore, 
calling Split on a good edge i lying on a path of length at least 2 results in 
both i's endpoints as new leaves in their respective sides of the i-cut. As a result, 
Pluck-a-leaf then completely unravels the path on which i lies. Therefore, in 
a successful run of the algorithm, Split is called no more than t times. All that 
remains is to bound t. 

t is the number of paths on the path decomposition of V . a partition for which 
Pluck-a-leaf failed to execute. Observe that if the forest T[P] had even a single 



leaf connected to the rest of its tree by a good coordinate, then Pluck-a-leaf 
would continue - such a leaf, by definition, is the only terminal on which the 
good coordinate takes a certain value. It follows that I, the number of leaves in 
T[V] is bounded by 2q, the number of bad edges in T. Removing the internal 
splits then leaves us with at most 21 paths; removing the bad coordinates' edges 
adds at most 2q — I new paths (for every bad coordinate k adjacent to a leaf, 
removing k does not create a new path). All in all, t < 2l + 2q — l < Aq. Therefore, 
each call to Split has success probability > 1 — 4 q^| = 1 — and Split is 
called at most Aq times. 

□ 

4 The General Case: Interchangeable Coordinates May 
Exist 

Before describing the general case, let us briefly discuss why the conventional 
way of initially contracting all interchangeable coordinates and applying the 
algorithm from the Section [3] might result in a tree of cost d+uj(q 2 ). The analysis 
of the first two steps of the algorithm still holds. The problem lies in the base 
of the recursion, where the algorithm runs the MST-based 2-approximation. 
Indeed, the MST algorithm is invoked on < AOq 2 contracted coordinates, but 
they correspond to d original coordinates, and it is possible that d^> q 2 . So by 
using any constant approximation on this entire forest, we may end with a tree 
of cost d + 2d which isn't d + 0(q 2 ). 

Our revised algorithm does not contract edges initially. Instead, let us define a 
simple coordinate as one for which Split(i) succeeds. So, the first alteration we 
make to the algorithm is to call Split as long as the set of simple coordinates 
is sufficiently big. However, most alterations lie in the base of the recursion. 
Below we detail the algorithm and analyze its correctness. In the algorithm's 
description, for any coordinate i we denote the set of coordinates interchangeable 
with i by Wi, and their number as w(i) = \W(i)\. 

Theorem 3. With probability > 1/2, the algorithm in Figure^ returns a tree 
of cost d + 0(q 2 ). 

The proof of Theorem [3] follows the same outline as the proof of Theorem [2j 
Observe that Lemmas [T] and [3] still holed. Therefore, with probability > 1/2, the 
algorithm enters the base of the recursion with a proper partition. Thus, by the 
following lemma, the algorithm outputs a tree of cost d + 0(q 2 ). 

Lemma 4. Assume that the base of the recursion (i.e., Step 3) is called on a 
proper partition V of the terminals over d! non- constant coordinates. Then the 
algorithm returns a forest of cost d' + 0(q 2 ). 

The full proof of Lemma 2] is deferred to the appendix. However, let us sketch 
the main outline of the proof. Recall the good path decomposition we used in 
the proof of Lemma [3] We partition its paths in the following way. 

2 Clearly, Split cannot abort now, but it might be the case that the algorithm picks 
i which is a bad coordinate. This can happen with probability < q/8q 2 = l/8q. 



input: A partition V of current terminals. Initially, V is the singleton set V = {C}. 

1. if Pluck-a-leaf succeeds and returns (x, V') 

— recurse on V\ then Paste-a-leaf x and return the resulting forest. 

2. else-if the number of simple coordinates on V is at least Sq 2 

— Pick a simple coordinate i u.a.r and invoke Split(i) . 

— if Split succeeds: recurse on V ' , then Merge x and x 1 , and return the result- 
ing forest; otherwise fail. 

3. else 

— Contract all Wi into i 

— For every i with w(i) > q and the (unique) component P in which the i-cut 
resides, 

• Apply pattern matching to (P, i). Let (D l , h 1 ) be the pattern of i. 

• if i is simple, split P into Po U {x 1 } and Pi U {x 1 }. 

• else 

* Define the node y(i) as the node where tji — 0, every coordinate j G D l 
is set to bj, and every coordinate j ^ D 1 is set to 0. 

* Define y(i) to be y(i) with coordinate i flipped. 

* V = V \ {P} U {Po U U {Pi U {yljj}}. 

— For every P £ V find its MST T(P), and retrieve the forest {T(P)}. 

— For every i with w(i) > q: 

if i was simple, add an edge labeled i between x 1 and x l 
else add an edge labeled i between y(i) and y(i). 

— Expand all contracted coordinates to their original set of coordinates by re- 
placing i with a path of length w(i). Return the resulting forest. 



Fig. 2. Algorithm for the general case 

— Paths with at least one terminal on them. On such paths, because all inter- 
changeable coordinates may appear in T in any order, then all coordinates 
on such paths are simple. So, when we enter the base of the recursion, there 
are at most 8g 2 edges on such paths. 

— Paths with no terminal on them, with length > q. Such paths are composed 
of interchangeable coordinates, and since there are more than q of those, we 
deduce all of them are good. Therefore, the endpoints of such paths are fixed 
up to at most q (bad) coordinates. We therefore contract these edges, split 
on them, and introduce into each side of the cut an arbitrary endpoint, by 
replacing non-fixed coordinates with zeros. So on each side of the cut the 
cost of the subtree increases by at most q, and since there are at most 4q such 
paths, our overall cost for introducing these artificial endpoints is 0(q 2 ). 

— Paths with no terminal on them, with length < q. Such paths are composed 
of interchangeable coordinates, but we do not contract them. Since there are 
at most 4<7-g edges on such paths, we run the MST approximation, and incur 
a cost of 0(q 2 ) for edges on such paths. 

Runtime analysis: Pluck-a-leaf can be implemented in time linear in the 
size of the dataset, i.e. 0{nd). Counting the number of simple coordinates takes 
time 0(nd 2 ), and Split takes time 0(nd). A naive implementation of the base 
case of the recursion takes time 0{nd 2 ) for contracting coordinates, and the rest 



can be implemented in time 0(nd). Hence the time to process each node in the 
recursion tree is at most 0(nd ). Since there are at most 0{q) nodes in the 
recursion tree, the total runtime is 0{qnd 2 ). 

5 Discussion and Open Problems 

This paper presents a randomized approximation algorithm for constructing 
near-perfect phylogcnics. In order to achieve this, we obtain a Steincr tree of 
low additive error. However, from the biological perspective, the goal is to find 
a good evolutionary tree, one that will give correct answers to questions like 
"what is the common ancestor of the following species?" or "which of the two 
gene-mutations happened earlier?". Such questions, we hope, can be answered 
by finding the most-parsimonious phylogenetic tree over the given taxa. Hence, 
it is also desirable that any low-cost tree which we output also captures a lot of 
the structure of the optimal tree. 

Wc would like to point out that our algorithm in fact has this valuable 
property. Notice that until the base case of the recursion, both Pluck-a-leaf 
and Split subroutines construct the optimal tree, and correctly identify the 
endpoints of the edges they remove. Even when the algorithm reaches the base 
case of the recursion - we can declare every edge of weight > q to be good, 
and we know its endpoints up to at most q coordinates. In total, our algorithm 
gives the structure of the optimal tree up to 0(q 2 ) edges, and those edges can 
be marked as "unsure" . 

Several open problems remain for this work. The most straight-forward one 
is whether one can devise an algorithm outputting a phylogenetic tree of cost 
d + 0{q)l Alternatively, one may try to design exact algorithms that are effi- 
cient even for q = w(logd). We suspect that even the case of q — 0((logd) 2 ) 
poses quite a challenge. Finally, extending our results to non-binary alphabets 
is intriguing. Note however that even the case of perfect phylogenies is NP-hard, 
and tractable only for moderately sized alphabets. Furthermore, our bottom-up 
approach completely breaks down for non-binary alphabets (see comment past 
Lemma [T]), so devising an additive- approximation algorithm for the phylogeny 
problem with non-binary alphabets requires a different approach altogether. 
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A Appendix 



Here, we give short proofs of the basic properties presented in Fact [T] Many of 
these results are also found in other work [2J. 

1. Let S be a set of coordinates that define the same cut over the terminals (up 
to renaming of Pq and Pi). Then whenever any edge in T is labeled with 
some i € S, then the edge lies on a path on which each edge is labeled with 
a unique coordinate from S 1 , and no node of the |jf?|-long path is a terminal. 
Proof. Assume i, j £ S define the same cut over C. Given some edge where 
i flips, consider the shortest path from a terminal t\ through to another 
terminal t-2- If every node on this path is of degree 2, there is no terminal 
before £2, and j flips before £2. Now, suppose some node on the path between 
£1 and £2 has degree more than 2. Either j flips on each outgoing path before 
any terminals, or j does not define the same cut (since each outgoing path 
has a terminal on it). But if j flips on all outgoing paths before any terminal 
occurs on those paths, relabel the Steiner nodes on each path so that j is 
constant along those outgoing paths. Then, add one Steiner node and an 
edge from the endpoint of e; which flips j. This tree has cost strictly less 
than the tree which flipped j on each outgoing path, since there were at least 
two paths, each of which flipped j. 

□ 

2. For any two good coordinates, i =/= j, one side of the i-cut is contained 
within one side of the j-cut. Alternatively, there exist values bj such that all 
terminals on one side of the z-cut have their jth coordinate set to bj . 
Proof. Suppose i is good. Then, consider the i-cut in T. Since j is good, j 
may flip only once in the tree. If j flips in Co, then j is constant in C\. If j 
flips in Ci, then j is constant in Co- 

□ 

3. Fix any good coordinate i and let j be a good coordinate such that all 
terminals on one side of the z-cut have their j coordinate set to bj. Then 
both endpoints of the edge labeled % have their j th coordinate set to bj . 
Proof. Suppose that some endpoint of the edge on which i flips has j (which 
is constant on Co) set to bj. Then, since the edge allows only one coordinate 
to flip across it, both endpoints are labelled with j = bj. Then, the side 
where j is constant (say Co) has to pay for j. 

If j is constant and set to bj on Ci, j has to flip twice, a contradiction since 
j is good. If j is constant and set to bj on Co, then the labels on i's edge 
are labelled by some constant setting of j. If j is non-constant Assuming 
j is non-constant for Ci, j flips somewhere in C\. But then, j flips twice, 
contradicting the fact that j is good. 

□ 

4. A good coordinate i and a bad coordinate i' cannot define the same cut. 
Proof. This follows directly from Fact 2.1, since i and %' will occur the same 
number of times in an optimal tree and a good coordinate occurs exactly 
once, while a bad coordinate occurs at least twice. 

□ 



B Proof of Lemma 0] 



Here we give the full proof of LcmmaH] For convenience, we reiterate the lemma. 

Lemma 5. Assume that when step 3 is entered, V is a proper partition of the 
terminals over dl non-constant coordinates. Then step 3 of the algorithm returns 
a forest of cost d! + 0(q 2 ). 

Proof. Recall that for any coordinate i, we define the weight of i as the total 
number of coordinates interchangeable with i. Let "pp° st denote the instance 
which results after splitting on all coordinates with weight greater than q. The 
proof of the lemma reduces to showing that there exists a forest over 7- >post of 
cost 0{q 2 ). If such a forest exists, the MST-based 2-approximation for each com- 
ponent returns a forest of cost 0(q 2 ), and reconstructing the tree with plucked 
and split edges has weight at most d! . So, the forest over V we get has cost 
< d! + 0(q 2 ). 

The set of all terminals in all components of "P post is composed of terminals 
that belong to the original V, and the new terminals, added by coordinates of 
weight more than q which are not simple. We construct a forest of low cost over 
-ppost ky ^ taking an optimal Steiner forest T over V , (ii) for every simple co- 
ordinate which is split upon, remove the path of corresponding coordinates from 
T, and (iii) for every non-simple coordinate split upon, remove the path of cor- 
responding coordinates and introduce the two vertices y(i) and y(i), connecting 
each vertex to the end-point of the path that resides in the same side of the cut. 
We denote the resulting forest by 7~P ost j and show that 7~ post contains at most 
0(q 2 ) edges. 

First, observe that since there are at most q bad coordinates, by splitting 
only on coordinates with weight more than q, we are guaranteed to split only 
on good coordinates (we know from Fact [1] that the good and bad coordinates 
cannot define the same cut). Therefore, since V is proper, the coordinate i resides 
in a single component. Second, observe that the base of the recursion is executed 
only when Pluck-a-leaf fails. Therefore, as in the proof of Lemma [3l T is a 
forest with at most 2q leaves, so its path decomposition contains at most 4q 
paths. Our next observation is the following proposition. 

Proposition 1. Fix a path in the path decomposition of the forest. If there exists 
some non-endpoint terminal on this path, all coordinates on this path are simple. 

Proof. Let i be a coordinate associated with an edge on such a path. Let x % 
be a terminal on the path which is the closest to i. Recall the observation from 
before, that any two coordinates that yield the same terminal cut must appear in 
the optimal tree adjacent to one another, and furthermore, their order does not 
matter (see Fact [T]). Therefore, we may assume i is adjacent to x l . Furthermore, 
as x 1 is not an endpoint, there exists a good coordinate j ^ i adjacent to x l 
on this path. It follows that i and j determine two different cuts over the set of 
terminals. Then, just as in the proof of Lemma |3l i is simple, where x % is the 
unique terminal on one side of its cut matching i's pattern. 



□ 

Proposition!]] means that any non-simple coordinate corresponds to an entire 
path in the path decomposition (because all edges on a path with no terminals 
on it determine the same cut). We deduce that (a) the number of non-simple 
contracted coordinates of weight more than q is at most 8<j. Furthermore, the 
number of edges on any path of length no more than q with no terminal on them 
is at most 8q 2 . Finally, since the base of the recursion is executed only when 
Split fails to execute, the number of edges on paths that do have a terminal on 
them is at most 16q 2 . It follows that the path decomposition of T contains at 
most 16q 2 edges, in addition to the edges on paths of length more than q with 
no terminal on them (and each such path causes the algorithm to split exactly 
once). 

We can now upper bound the number of edges in 7~ post . The forest 7~ post con- 
tains at most 2q < 2q 2 bad edges; at most 16q 2 edges from T; and all the paths 
between the vertices we introduce by adding the j/(«)-vertices and connecting 
them to T . Let i be a non-simple contracted coordinate which was split upon, 
and let u — > v be the corresponding path on T which was removed. Assume y(i) 
resides on the same side of the cut as u. Clearly, u and y(i) identify on all coor- 
dinates on the u — > v path, but they also identify on all other good coordinates: 
a good coordinate j appears most once in T, so it, v and all terminals on at least 
one side of the u — v cut has the same value on their jth coordinate. It follows 
that the path between y(i) and u is of length at most q, and the same applies to 
the path between y(i) and v. Therefore, the number of edges on paths we have 
yet to upper bound, sum to no more than 8q ■ 2 ■ q — 16q 2 . Thus 7~ po8t contains 
no more than 34q 2 edges. This completes the proof. 

□ 



